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Abstract 

In this paper, we propose to learn temporal embeddings 
of video frames for complex video analysis. Large quanti¬ 
ties of unlabeled video data can be easily obtained from the 
Internet. These videos possess the implicit weak label that 
they are sequences of temporally and semantically coher¬ 
ent images. We leverage this information to learn tempo¬ 
ral embeddings for video frames by associating frames with 
the temporal context that they appear in. To do this, we 
propose a scheme for incorporating temporal context based 
on past and future frames in videos, and compare this to 
other contextual representations. In addition, we show how 
data augmentation using multi-resolution samples and hard 
negatives helps to significantly improve the quality of the 
learned embeddings. We evaluate various design decisions 
for learning temporal embeddings, and show that our em¬ 
beddings can improve performance for multiple video tasks 
such as retrieval, classification, and temporal order recov¬ 
ery in unconstrained Internet video. 

1. Introduction 

Video data is plentiful and a ready source of information 
- what can we glean from watching massive quantities of 
videos? At a fine granularity, consecutive video frames are 
visually similar due to temporal coherence. At a coarser 
level, consecutive video frames are visually distinct but se¬ 
mantically coherent. 

Learning from this semantic coherence present in video 
at the coarser-level is the main focus of this paper. Purely 
from unlabeled video data, we aim to learn embeddings for 
video frames that capture semantic similarity by using the 
temporal structure in videos. The prospect of learning a 
generic embedding for video frames holds promise for a va¬ 
riety of applications ranging from generic retrieval and sim¬ 
ilarity measurement, video recommendation, to automatic 
content creation such as summarization or collaging. In this 
paper, we demonstrate the utility of our video frame embed¬ 
dings for several tasks such as video retrieval, classification 



Figure 1. The temporal context of a video frame is crucial in deter¬ 
mining its true semantic meaning. For instance, consider the above 
example where the embeddings of different semantic classes are 
shown in different colors. The middle frame from the two wed¬ 
ding videos correspond to visually dissimilar classes of “church 
ceremony” and “court ceremony”. However, by observing the 
similarity in their temporal contexts we expect them to be seman¬ 
tically closer. Our work leverages such powerful temporal context 
to learn semantically rich embeddings. 

and temporal order recovery. 

The idea of leveraging sequential data to learn embed¬ 
dings in an unsupervised fashion is well explored in the Nat¬ 
ural Language Processing (NLP) community. In particular, 
distributed word vector representations such as word2vec 
[20] have the unique ability to encode regularities and pat¬ 
terns surrounding words, using large amounts of unlabeled 
data. In the embedding space, this brings together words 
that may be very different, but which share similar contexts 
in different sentences. This is a desirable property we would 
like to extend to video frames as well as shown in Fig. 1. 
We would like to have a representation for frames which 
captures the semantic context around the frame beyond the 
visual similarity obtained from temporal coherence. 

However, the task of embedding frames poses multiple 
challenges specific to the video domain: 1. Unlike words, 
the set of frames across all videos is not discrete and quan- 
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tizing the frames leads to a loss in information; 2. Tempo¬ 
rally proximal frames within the same video are often visu¬ 
ally similar and might not provide useful contextual infor¬ 
mation; 3. The correct representation of context surround¬ 
ing a frame is not obvious in videos. The main contribution 
of our work is to propose a new ranking loss based embed¬ 
ding framework, along with a contextual representation spe¬ 
cific to videos. We also develop a well engineered data aug¬ 
mentation strategy to promote visual diversity among the 
context frames used for embedding. 

We evaluate our learned embeddings on the standard 
tasks of video event retrieval and classification on the 
TRECVID MED 2011 [25] dataset, and compare to several 
recently published spatial and temporal video representa¬ 
tions [5, 30]. Aside from semantic similarity, the learned 
embeddings capture valuable information in terms of the 
temporal context shared between frames. Hence, we also 
evaluate our embeddings on two related tasks: 1. tem¬ 
poral frame retrieval, and 2. temporal order recovery in 
videos. Our embeddings improve performance on all tasks, 
and serves as a powerful representation for video frames. 

2. Related Work 

Video features. Standard tasks in video such as classifica¬ 
tion and retrieval require a well engineered feature repre¬ 
sentation, with many proposed in the literature [1,6, 11, 18, 
22, 23, 24, 26, 29, 37, 36]. Deep network features learned 
from spatial data [8, 12, 30] and temporal flow [30] have 
also shown comparable results. However, recent works in 
complex event recognition [38, 4 ] have shown that spa¬ 
tial Convolutional Neural Network (CNN) features learned 
from ImageNet [2] without fine-tuning on video, accompa¬ 
nied by suitable pooling and encoding strategies achieves 
state-of-the-art performance. In contrast to these methods 
which either propose handcrafted features or learn feature 
representations with a fully supervised objective from im¬ 
ages or videos, we try to learn an embedding in an unsu¬ 
pervised fashion. Moreover, our learned features can be ex¬ 
tended to other tasks beyond classification and retrieval. 

There are several works which improve complex event 
recognition by combining multiple feature modalities [10, 
21, 3 ]. Another related line of work is the use of sub¬ 
events defined manually [5], or clustered from data [17] to 
improve recognition. Similarly, Yang et al. used low dimen¬ 
sional features from deep belief nets and sparse coding [39]. 
While these methods are targeted towards building features 
specifically for classification in limited settings, we propose 
a generic video frame representation which can capture se¬ 
mantic and temporal structure in videos. 

Unsupervised learning in videos. Learning features with 
unsupervised objectives has been a challenging task in the 
image and video domain [7, 19, 34]. Notably, [19] devel¬ 
ops an Independent Subspace Analysis (ISA) model for fea¬ 


ture learning using unlabeled video. Recent work from [< ] 
also hints at a similar approach to exploit the slowness prior 
in videos. Also, recent attempts extend such autoencoder 
techniques for next frame prediction in videos [28, 32]. 
These methods try to capitalize on the temporal continuity 
in videos to learn an LSTM [40] representation for frame 
prediction. In contrast to these methods which aim to pro¬ 
vide a unified representation for a complete temporal se¬ 
quence, our work provides a simple yet powerful represen¬ 
tation for independent video frames and images. 

Embedding models. The idea of embedding words to a 
dense lower dimension vector space has been prevalent in 
the NLP community. The word2vec model [20] tries to 
learn embeddings such that words with similar contexts in 
sentences are closer to each other. A related idea in com¬ 
puter vision is the embedding of text in the semantic visual 
space attempted by [3, 15] based on large image datasets 
labeled with captions or class names. While these methods 
focus on different scenarios for embedding text, the aim of 
our work is to generate an embedding for video frames. 


3. Our Method 

Given a large collection of unlabeled videos, our goal 
is to leverage their temporal structure to learn an effective 
embedding for video frames. We wish to learn an embed¬ 
ding such that the context frames surrounding each target 
frame can determine the representation of the target frame, 
similar to the intuition from word2vec [2 ]. For example, 
in Fig. 1, context such as “crowd” and “cutting the cake” 
provides valuable information about the target “ceremony” 
frames that occur in between. This idea is fundamental to 
our embedding objective and helps in capturing semantic 
and temporal interactions in video. 

While the idea of representing frames by embeddings is 
lucrative, the extension from language to visual data is not 
straightforward. Unlike language we do not have a natural, 
discrete vocabulary of words. This prevents us from using a 
softmax objective as in the case of word2vec [20]. Further, 
consecutive frames in videos often share visual similarity 
due to temporal coherence. Hence, a naive extension of [20] 
does not lead to good vector representations of frames. 

To overcome the problem of lack of discrete words, we 
use a ranking loss which explicitly compares multiple pairs 
of frames across all videos in the dataset. This ensures that 
the context in a video scores the target frame higher than 
others in the dataset. We also handle the problem of visu¬ 
ally similar frames in temporally smooth videos through a 
carefully designed sampling mechanism. We obtain context 
frames by sampling the video at multiple temporal scales, 
and choosing hard negatives from the same video. 




(b) Our model (no future) (c) Our model (no temporal) 


Figure 2. Visualizations of the temporal context of frames used in: 
(a) our model (full), (b) our model (no future), and (c) our model 
(no temporal). Green boxes denote target frames, magenta boxes 
denote contextual frames, and red boxes denote negative frames. 


3.1. Embedding objective 

We are given a collection of videos V, where each 
video v £ V is a sequence of frames {s v i,..., s vn }. We 
wish to obtain an embedding f v j for each frame s V j. Let 
f V j = f( s vj] W e ) be the temporal embedding function 
which maps the frame s v j to a vector. The model embed¬ 
ding parameters are given by W e , and will be learned by our 
method. We embed the frames such that the context frames 
around the target frame predict the target frame better than 
other frames. The model is learned by minimizing the sum 
of objectives across all videos. Our embedding loss objec¬ 
tive is shown below: 


J(w e ) = V V max (0,1 — (f v j — /_) • h v j ), (1) 

wG V &vj ClV 
S — ^-Svj 

where /_ is the embedding of a negative frame s_, and 
the context surrounding the frame s v j is represented by the 
vector h V j. Note that unlike the word vector embedding 
models in word2vec [20], we do not use an additional linear 
layer for softmax prediction on top of the context vector. 

3.2. Context representation 

As we verify later in the experiments, the choice of con¬ 
text is crucial to learning good embeddings. A video frame 
at any time instant is semantically correlated with both past 
and future frames in the video. Hence, a natural choice for 
context representation would involve a window of frames 
centered around the target frame, similar to the skip-gram 
idea used in word2vec [20]. Along these lines, we propose 
a context representation given by the average of the frame 
embeddings around the target frame. Our context vector 
h V j for a frame s v j is: 


1 y- 

h y j = fvj+t + fvj-ti 

t =1 


( 2 ) 


where T is the window size, and f v j is the embedding of 
the frame s V j. This embedding model is shown in Fig. 2(a). 
For negatives, we use frames from other videos as well as 
frames from the same video which are outside the temporal 
window, as explained in Sec. 3.4. 

Two important characteristics of this context representa¬ 
tion is that it 1. makes use of the temporal order in which 
frames occur and 2. considers contextual evidence from 
both past and future. In order to examine their effect on 
the quality of the learned embedding, we also consider two 
weaker variants of the context representation below. 

Our model (no future). This one-sided contextual repre¬ 
sentation tries to predict the embedding of a frame in a video 
only based on the embeddings of frames from the past as 
shown in Fig. 2(b). For a frame s V j, the context f l ™°f uture 
is given by: 


Kf uture = ^ ( 3 ) 

t =i 

where T is the window size. 

Our model (no temporal). An even weaker variant of con¬ 
text representation is simple co-occurrence without tempo¬ 
ral information. We also explore a contextual representation 
which completely neglects the temporal ordering of frames 
and treats a video as a bag of frames. The context h™° ternp 
for a target frame s V j is sampled from the embeddings cor¬ 
responding to all other frames in the same video: 


K° temp e {f vk | k^j}. (4) 

This contextual representation is visualized in Fig. 2(c). 

3.3. Embedding function 

In the previous sections, we introduced a model for rep¬ 
resenting context, and now move on to discuss the em¬ 
bedding function f(sij]W e ). In practice, the embedding 
function can be a CNN built from the frame pixels, or any 
underlying image or video representation. However, fol¬ 
lowing the recent success of ImageNet trained CNN fea¬ 
tures for complex event videos [38, 41], we choose to learn 
an embedding on top of the fully connected fc6 layer fea¬ 
ture representation obtained by passing the frame through 
a standard CNN [1 ] architecture. We use a simple model 
with a fully connected layer followed by a rectified linear 
unit (ReLU) and local response normalization (LRN) layer, 
with dropout regularization. In this architecture, the learned 
model parameters W e correspond to the weights and bias of 
our affine layer. 
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Figure 3. Multi-resolution sampling and hard negatives used in our 
full context model (T = 1). For a target frame (green), we sample 
context frames (magenta) at varying resolutions, as shown by the 
rows in this figure. We take hard negatives as examples in the same 
video that fall outside the context window (red). 

3.4. Data augmentation 

We found that a careful strategy for sampling context 
frames and negatives is important to learning high quality 
embeddings in our models. This helps both in handling the 
problem of temporal smoothness and prevents the model 
from overfitting to less interesting video-specific properties. 
Multi-resolution sampling. Complex events progress at 
different paces within different videos. Densely sampling 
frames in slowly changing videos can lead to context win¬ 
dows comprised of frames that are visually very similar to 
the target frame. On the other hand, a sparse sampling of 
fast videos could lead to context windows only composed of 
disjoint frames from unrelated parts of the video. We over¬ 
come these problems through multi-resolution sampling as 
shown in Fig. 3. For every target frame, we sample context 
frames from multiple temporal resolutions. This ensures a 
good trade-off between visual variety and semantic related¬ 
ness in the context windows. 

Hard negatives. The context frames, as well as the target to 
be scored are chosen from the same video. This causes the 
model to cluster frames from the same video based on less 
interesting video-specific properties such as lighting, cam¬ 
era characteristics and background, without learning any¬ 
thing semantically meaningful. We avoid such problems by 
choosing hard negatives from within the same video as well. 
Empirically, this improves performance for all tasks. The 
negatives are chosen from outside the range of the context 
window within a video as depicted in Fig. 3. 

3.5. Implementation details 

The context window size was set to T = 2, and the em¬ 
bedding dimension to 4096. The learning rate was set to 
0.01 and gradually annealed in steps of 5000. The training 
is typically completed within a day on 1 GPU with Caffe 
[9] for a dataset of approximately 40000 videos. All videos 
were first down-sampled to 0.2 fps before training. The em¬ 
bedding code as well as the learned models and video em¬ 
beddings will be made publicly available upon publication. 


4. Experimental Setup 

Our embeddings are aimed at capturing semantic and 
temporal interactions within complex events in a video, and 
thus we require a generic set of videos with a good variety 
of actions and sub-events within each video. Most stan¬ 
dard datasets such as UCF-101 [31] and Sport-IM [12] are 
comprised of short video clips capturing a single sports ac¬ 
tion, making them unsuitable for our purpose. Fortunately, 
the TRECVID MED 2011 [25] dataset provides a large set 
of diverse videos collected directly from YouTube. More 
importantly, these videos are not simple single clip videos; 
rather they are complex events with rich interactions be¬ 
tween various sub-events within the same video [ 5 ]. Specif¬ 
ically, we learn our embeddings on the complete MED 11 
DEV and TEST sets comprised of 40021 videos. A sub¬ 
set of 256 videos from the DEV and TEST set was used 
for validation. The DEV and TEST sets are typical random 
assortments of YouTube videos with minimal constraints. 

We compare our embeddings against different video rep¬ 
resentations for three video tasks: video retrieval, complex 
event classification, and temporal order recovery. All ex¬ 
periments are performed on the MED 11 event kit videos, 
which are completely disjoint from the training and valida¬ 
tion videos used for learning our embeddings. The event 
kit is composed of 15 event classes with approximately 
100 — 150 videos per event, with a total of 2071 videos. 

We stress that the embeddings are learned in a com¬ 
pletely unsupervised setting and capture the temporal and 
semantic structure of the data. We do not tune them specif¬ 
ically to any event class and ~ 0.3% of the DEV and TEST 
sets contain videos from each category. This is not unrea¬ 
sonable, since any large unlabeled video dataset is expected 
to contain a small fraction of videos from every event. 

5. Video Retrieval 

In retrieval tasks, we are given a query, and the goal is 
to retrieve a set of related examples from a database. We 
start by evaluating our embeddings on two types of retrieval 
tasks: event retrieval and temporal retrieval. The retrieval 
tasks help to evaluate the ability of our embeddings to group 
together videos belonging to the same semantic event class 
and frames that are temporally coherent. 

5.1. Event retrieval 

In the event retrieval scenario, we are given a query video 
from the MED11 event kit and our goal is to retrieve videos 
that contain the same event from the remaining videos in 
the event kit. For each video in the event kit, we sort all 
other videos in the dataset based on their similarity to the 
query video using the cosine similarity metric, which we 
found to work best for all representations. We use Average 
Precision (AP) to measure the retrieval performance of each 
















Method 

mAP ( %) 

Two-stream pre-trained [3 ] 

20.09 

fc6 

20.08 

fc7 

21.24 

Our model (no temporal) 

21.92 

Our model (no future) 

21.30 

Our model (no hard neg.) 

24.22 

Our model 

25.07 


Table 1. Event retrieval results on the MED11 event kits. 


video and provide the mean Average Precision (mAP) over 
all videos in Tab. 1. For all methods, we uniformly sample 
4 frames per video and represent the video as an average 
of the features extracted from them. The different baseline 
methods used for comparison are explained below: 

• Two-stream pre-trained : We use the two-stream CNN 
from [30] pre-trained on the UCF-101 dataset. The 
models were used to extract spatial and temporal fea¬ 
tures from the video with a temporal stack size of 5. 

• fc6 and fc7: Features extracted from the ReLU layers 
following the corresponding fully connected layers of 
a standard CNN model [16] pre-trained on ImageNet. 

• Our model (no temporal)'. Our model trained with no 
temporal context (Fig. 2(c)). 

• Our model (no future): Our model trained with no fu¬ 
ture context (Fig. 2(b)) but with multi-resolution sam¬ 
pling and hard negatives. 

• Our model (no hard neg.)\ Our model trained without 
hard negatives from the same video. 

• Our model: Our full model trained with multi¬ 
resolution sampling and hard negatives. 

We observe that our full model outperforms other rep¬ 
resentations for event retrieval. We note that in contrast to 
most other representations trained on ImageNet, our model 
is capable of being trained with large quantities of unlabeled 
video which is easy to obtain. This confirms our hypothesis 
that learning from unlabeled video data can improve feature 
representations. While the two-stream model also has the 
advantage of being trained specifically on a video dataset, 
we observe that the learned representations do not transfer 
favorably to the MED 11 dataset in contrast to fc7 and fc6 
features trained on ImageNet. A similar observation was 
made in [38, 41], where simple CNN features trained from 
ImageNet consistently provided the best results. 

Our embeddings capture the temporal regularities and 
patterns in videos without the need for expensive labels, 
which allows us to more effectively represent the seman¬ 
tic space of events. The performance gain of our full con¬ 
text model over the representation without temporal order 
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(a) fc7 (b) our embedding 

Figure 4. t-SNE plot of the semantic space for (a) fc7 and (b) our 
embedding. The different colors correspond to different events. 
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Figure 5. t-SNE visualization of words from synopses describing 
MED11 event kit videos. Each word is represented by the average 
of our embeddings corresponding to the videos associated with the 
word. We show sample video frames for a subset of the words. 


shows the need for utilizing the temporal information while 
learning the embeddings. 

Visualizing the embedding space. To gain a better qual¬ 
itative understanding of our learned embedding space, we 
use t-SNE [3 ] to visualize the embeddings in a 2D space. 
In Fig. 4, we visualize the fc7 features and our embedded 
features by sampling a random set of videos from the event 
kits. The different colors in the graph correspond to each of 
the 15 different event classes, as listed in the figure. Visu¬ 
ally, we can see that certain event classes such as “Groom¬ 
ing an animal”, “Changing a vehicle tire”, and “Making a 
sandwich” enjoy better clustering in our embedded frame¬ 
work as opposed to the fc7 representation. 

Another way to visualize this space is in terms of the ac¬ 
tual words used to describe the events. Each video in the 
MED 11 event kits is associated with a short synopsis de¬ 
scribing the video. We represent each word from this syn¬ 
opsis collection by averaging the embeddings of videos as¬ 
sociated with that word. The features are then used to pro- 
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Figure 6. The retrieval results for fc7 (last two columns) and our 
embedding (middle two columns). The first column shows the 
query frame and event, while the top 2 frames retrieved from the 
remaining videos are shown in the middle two column for our em¬ 
bedding, and the last two columns for fc7. The incorrect frames 
are highlighted in red, and correct frames in green. 


duce a t-SNE plot as shown in Fig. 5. We avoid noisy clus¬ 
tering due to simple co-occurrence of words by only plot¬ 
ting words which do not frequently co-occur in the same 
synopsis. We observe many interesting patterns. For in¬ 
stance, objects such as “river”, “pond” and “ocean” which 
provide the same context for a “fishing” event are clustered 
together. Similarly crowded settings such as “bollywood”, 
“military”, and “carnival” are clustered together. This pro¬ 
vides a visual clustering of the words based on shared se¬ 
mantic temporal context. 

Event retrieval examples. We visualize the top frames re¬ 
trieved for a few query frames from the event kit videos in 
Fig. 6. The query frame is shown in the first column along 
with the event class corresponding to the video. The top 2 
frames retrieved from other videos by our embedding and 
by fc7 are shown in the first and second columns for each 
query video, respectively. 

We observe a few interesting examples where the query 
appears visually distinct from the results retrieved by our 
embedding. These can be explained by noting that the re¬ 
trieved actions might co-occur in the same context as the 
query, which is captured by the temporal context in our 
model. For instance, the frame of a “bride near a car” re¬ 
trieves frames of “couple kissing”. Similarly, the frame of 
“kneading dough” retrieves frames of “spreading butter”. 

5.2. Temporal retrieval 

In the temporal retrieval task, we test the ability of our 
embedding to capture the temporal structure in videos. We 
sample four frames from different time instants in a video 


Method 

mAP ( %) 

Two-stream pre-trained [30] 

20.11 

fc6 

19.27 

fc7 

22.99 

Our model (no temporal) 

22.50 

Our model (no future) 

21.71 

Our model (no hard neg.) 

24.12 

Our model 

26.74 


Table 2. Temporal retrieval results on the MED11 event kits. 

and try to retrieve the frames in between the middle two 
frames. This is an interesting task which has potential for 
commercial applications such as ad placements in video 
search engines. For instance, the context at any time in¬ 
stant in a video can be used to retrieve the most suited video 
ad from a pool of video ads, to blend into the original video. 

For this experiment, we use a subset of 1396 videos from 
the MED 11 event kits which are at least 90 seconds long. 
From each video, we uniformly sample 4 context frames, 
3 positive frames from in between the middle two context 
frames, and 12 negative distractors from the remaining seg¬ 
ments of the video. In addition to the 12 negative distractors 
from the same video, all frames from other videos are also 
treated as negative distractors. For each video, given the 4 
context frames we evaluate our ability to retrieve the 3 pos¬ 
itive frames from this large pool of distractors. 

We retrieve frames based on their cosine similarity to the 
average of the features extracted from the context frames, 
and use mean Average Precision (mAP) as before. We use 
the same baselines as the event retrieval task. The results 
are shown in Tab. 2. 

Our embedding representation which is trained to cap¬ 
ture temporal structure in videos is seen to outperform the 
other representations. This also shows their ability to cap¬ 
ture long-term interactions between events occurring at dif¬ 
ferent instants of a video. 

Temporal retrieval examples. We visualize the top exam¬ 
ples retrieved for a few temporal queries in Fig. 7. Here, 
we can see just how difficult this task is, as often frames 
that seem to be viable options for temporal retrieval are not 
part of the ground truth. For instance, in the “sandwich” ex¬ 
ample, our embedding wrongly retrieves frames of human 
hands to keep up with the temporal flow of the video. 

6. Complex Event Classification 

The complex event classification task on the MED 11 
event kits is one of the more challenging classification tasks. 
We follow the protocol of [5, 27] and use the same train/test 
splits. Since the goal of our work is to evaluate the effective¬ 
ness of video frame representations, we use a simple linear 
Support Vector Machine classifier for all methods. 

Unlike retrieval settings, we are provided labeled train- 
















































Figure 7. The retrieval results for our embedding model on the temporal retrieval task. The first and last 2 columns show the 4 context 
frames sampled from each video, and the middle 3 columns show the top 3 frames retrieved by our embedding. The correctly retrieved 
frames are highlighted in green, and incorrect frames highlighted in red. 


Method 

mAP ( %) 

Two-stream fine-tuned [30] 

62.99 

ISA [ 9] 

55.87 

Izadinia et al. [: ] linear 

62.63 

Izadinia et al. [5] full 

66.10 

Raman, et al. [27] 

66.39 

fc6 

68.56 

fc7 

69.17 

Our model (no temporal) 

69.57 

Our model (no future) 

69.22 

Our model (no hard neg.) 

69.81 

Our model 

71.17 


Table 3. Event classification results on the MED11 event kits. 



(a) order recovered by fc7 



(b) order recovered by our embedding 


Figure 9. An example of the temporal ordering retrieved by fc7 
and our method for a “Making a sandwich” video. The indexes 
of the frames already in the correct temporal order are shown in 
green, and others in red. 


7. Temporal Order Recovery 


ing instances in the event classification task. Thus, we fine- 
tune the last two layers of the two-stream model (pre-trained 
on UCF-101) on the training split of the event kits, and 
found this to perform better than the pre-trained model. 

In addition to baselines from previous tasks, we also 
compare with [5], [19] and [27], with results shown in 
Tab. 3. Note that [5, 27] use a combination of multiple im¬ 
age and video features including SIFT, MFCC, ISA, and 
HOG3D. Further, they also use additional labels such as 
low-level events within each video. In Tab. 3, Izadinia et 
al. linear refers to the results without low-level event labels. 

We observe that our method outperforms ISA [19], 
which is also a unsupervised neural network feature repre¬ 
sentation. Additionally, the CNN features trained from Im- 
ageNet seem to perform better than most previous feature 
representations, which is also consistent with the retrieval 
results and previous work [38, 41 ]. Among the methods, the 
two-stream model holds the advantage of being fine-tuned 
to the MED 11 event kits. However, our performance gain 
could be attributed to the ability of our model to use large 
amounts of unlabeled data to learn a better representations. 


An effective representation for video frames should be 
able to not only capture visual similarities, but also preserve 
the structure between temporally coherent frames. This fa¬ 
cilitates holistic video understanding tasks beyond classifi¬ 
cation and retrieval. With this in mind, we explore the video 
temporal order recovery task, which seeks to show how the 
temporal interaction between different parts of a complex 
event are inherently captured by our embedding. 

In this task, we are given as input a jumbled sequence 
of frames belonging to a video, and our goal is to order the 
frames into the correct sequence. This has been previously 
explored in the context of photostreams [14], and has po¬ 
tential for use in applications such as album generation. 
Solving the order recovery problem. Since our goal is to 
evaluate the effectiveness of various feature representations 
for this task, we use a simple greedy technique to recover 
the temporal order. We assume that we are provided the 
first two frames in the video and proceed to retrieve the next 
frame (third frame) from all other frames in the video. This 
is done by averaging the first two frames and retrieving the 
closest frame in cosine similarity. We go on to greedily 
retrieve the fourth frame using the average of the second 
and third frames, and continue until all frames are retrieved. 



































Figure 8. After querying the Internet for images of the “wedding” event, we cluster them into sub-events and temporally organize the 
clusters using our model. On the left, we show sample images crawled for the “wedding” event, and on the right the temporal order 
recovered by our model is visualized along with manual captions for the clusters. 


Method 

1.4k Videos 

lk Videos 

Random chance 

50.00 

50.00 

Two-stream [30] 

42.05 

44.18 

fc6 

42.43 

43.33 

fc7 

41.67 

43.15 

Our model (pairwise) 

42.03 

43.72 

Our model (no future) 

40.91 

42.98 

Our model (no hard neg.) 

41.02 

41.95 

Our model 

40.41 

41.13 


Table 4. Video temporal order recovery results on the MED 11 
event kits evaluated using the Kendell tau distance (normalized to 
0-100). Smaller distance indicates better performance. The 1.4k 
Videos refers to the set of videos used in the temporal retrieval 
task, and the lk Videos refers to a further subset with the most 
visually dissimilar frames. 

In order to enable easy comparison across all videos, we 
sample the same number of frames (12) from each video 
before scrambling them for the order recovery problem. An 
example comparing our embeddings to fc7 is show in Fig. 9. 
Evaluation. We evaluate the performance for solving the 
order recovery problem using the Kendall tau [13] distance 
between the groundtruth sequence of frames and the se¬ 
quence returned by the greedy method. The Kendall tau 
distance is a metric that counts the number of pairwise dis¬ 
agreements between two ranked lists; the larger the distance 
the more dissimilar the lists. The performance of different 
features for this task is shown in Tab. 4, where the Kendall 
tau distance is normalized to be in the range 0 — 100. 

Similar to the temporal retrieval setting, we use the sub¬ 
set of 1396 videos which are at least 90 seconds long. These 
results are reported in the first column of the table. We ob¬ 
served that our performance was quite comparable to that 
of fc7 features for videos with visually similar frames like 
those from the “parade” event, as they lack interesting tem¬ 


poral structure. Hence, we also report results on the subset 
of 1000 videos which had the most visually distinct frames. 
These results are shown in the second column of the table. 
We also evaluated the human performance of this task on a 
random subset of 100 videos, and found the Kendell tau to 
be around 42. This is on par with the performance of the 
automatic temporal order produced by our methods, and il¬ 
lustrates the difficulty of this task for humans as well. 

We observe that our full context model trained with a 
temporal objective achieves the best Kendall tau distance. 
This improvement is more marked in the case of the lk 
Videos with more visually distinct frames. This shows the 
ability of our model to bring together sequences of frames 
that should be temporally and semantically coherent. 
Ordering actions on the Internet. Image search on the In¬ 
ternet has improved to the point where we can find relevant 
images with textual queries. Here, we wanted to investi¬ 
gate whether or not we could also temporally order images 
returned from the Internet for textual queries that involve 
complex events. To do this, we used query expansion on 
the “wedding” query, and crawled Google for a large set 
of images. Then, based on the queries, we clustered the 
images into sets of semantic clusters, and for each cluster, 
averaged our embedding features to obtain a representation 
for the cluster. With this representation, we then used our 
method to recover the temporal ordering of these clusters 
of images. In Fig. 8, we show the temporal ordering au¬ 
tomatically recovered by our embedded features, and some 
example images from each cluster. Interestingly, the recov¬ 
ered order seems consistent with typical wedding scenarios. 

8. Conclusion 

In this paper, we presented a model to embed video 
frames. We treated videos as sequences of frames and em¬ 
bedded them in a way which captures the temporal context 































surrounding them. Our embeddings were learned from a 
large collection of more than 40000 unlabeled videos, and 
have shown to be more effective for multiple video tasks. 
The learned embeddings performed better than other video 
frame representations for all tasks. The main thrust of our 
work is to push a framework for learning frame-level rep¬ 
resentations from large sets of unlabeled video, which can 
then be used for a wide range of generic video tasks. 
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